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Predicting and characterizing the crystal structure of materials is a key problem in materials research 
and development. It is typically addressed with highly accurate quantum mechanical computations 
on a small set of candidate structures, or with empirical rules that have been extracted from a large 
amount of experimental information, but have limited predictive power. In this letter, we transfer 
the concept of heuristic rule extraction to a large library of ab-initio calculated information, and 
demonstrate that this can be developed into a tool for crystal structure prediction. 
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Ab-initio methods, which predict materials properties 
from the fundamental equations of quantum mechanics, 
are becoming a ubiquitous tool for physicists, chemists, 
and materials scientists. These methods allow scien- 
tists to evaluate and pre-screen new materials "in sil- 
ico" , rather than through time-consuming experimenta- 
tion, and in some cases, even make suggestions for new 
and better materials [1-5]. One inherent limitation of 
most ab-initio approaches is that they do not make ex- 
plicit use of results of previous calculations when study- 
ing a new system. This can be contrasted with data- 
centered methods, which mine existing data libraries to 
help understand new situations. The contrast between 
data-centered and traditional ab-initio methods can be 
seen clearly in the different approaches used to predict 
the crystal structure of materials. This is a difficult but 
important problem that forms the basis for any rational 
materials design. In heuristic models, a large amount of 
experimental observations are used in order to extract 
rules which rationalize crystal structure with a few sim- 
ple physical parameters such as atomic radii, electroneg- 
ativities, etc.. The Miedema rules for predicting com- 
pound forming [6], or the Pettifor maps [7] which can be 
used to predict the structure of a new binary material 
by correlating the position of its elements in the periodic 
table to those of systems for which the stable crystal 
structure is known, arc excellent examples of this. Ab- 
initio methods differ from these data-centered methods 
in that they do not use historic and cumulative infor- 
mation about previously studied systems, but rather try 
to determine structure by optimizing from scratch the 
complex quantum mechanical description of the system, 
either directly (as in ab-initio Molecular Dynamics) , or in 
coarse-grained form (as in lattice models [8-10]). Here, 
we merge the ideas of data-centered methods with the 
predictive power of ab-initio computation. We propose 
a new approach whereby ab-initio investigations on new 
systems are informed with knowledge obtained from re- 
sults already collected on other systems. We refer to 
this approach as Data Mining of Quantum Calculations 
(DMQC), and demonstrate its efficiency in increasing the 
speed of predicting the crystal structure of new and un- 



known materials. Using a Principal Component Analysis 
on over 6000 ab-initio energy calculations, we show that 
the energies of different crystal structures in binary alloys 
are strongly correlated between different chemical sys- 
tems, and demonstrate how this correlation can be used 
to accelerate the prediction of new systems. We believe 
that this is an interesting new direction to address in a 
practical manner the problem of predicting the structure 
of materials. 

Using Density Functional Theory we have calculated 
a library of ab-initio energies for 114 different crystal 
structures in each of 55 binary metallic alloys. About 
1/3 of the crystal structures in the library were chosen 
from the most common binary crystal structures in the 
CRYSTMET database for intermetallics [11]. The rest 
are superstructures of the fee, bee, and hep lattices. The 
alloys include all 45 binaries that can be made from row 
4 transition metals, as well as ScAl, AgMg, and 8 bi- 
nary Ti alloys (AgTi, CdTi, MoTi, PdTi, RhTi, RuTi, 
TcTi, TiZr). The formation energy for each structure is 
determined with respect to the most stable structure of 
the pure elements. Energy calculations were done using 
density functional theory, in the local density approxi- 
mation, with the Ceperley-Alder form for the correlation 
energy as parameterized by Perdew-Zunger [12] with ul- 
trasoft pseudopotentials, as implemented in VASP [13]. 
Calculations are at zero temperature and pressure, and 
without zero-point motion. The energy cutoffs in an al- 
loy was set to 1.5 times the larger of the suggested energy 
cutoffs of the pseudopotentials of the elements of the al- 
loy (suggested energy cutoffs are derived by the method 
described in [13]). Brillouin zone integrations were done 
using 2000/ (number of atoms in unit cell) k-points dis- 
tributed as uniformly as possible on a Monkhorst-Pack 
mesh. We verified that with these energy cutoffs and k- 
points mesh the absolute energy is converged to better 
then 10 meV/atom. Energy differences between struc- 
tures are expected to be converged to much smaller tol- 
erances. Spin polarization was not used as no magnetic 
alloys were studied. All structures were fully relaxed. 

For each alloy i, consider the 114 structural forma- 
tion energies as the components of a vector E; in a 114- 
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dimensional space. If the energies of the structures are 
linearly dependent then the vectors for each alloy will not 
be distributed randomly in the 114 dimensional space, 
but confined in a subspace of reduced dimension. To look 
for such approximate linear dependencies we use Princi- 
pal Component Analysis [14] (PC A). This allow us to 
express the energy vector of an alloy as an expansion in 
a basis of reduced dimension d, Ej = Y^j=i a ij e i + e i(d), 
where ej(rf) is the error vector for the alloy i. PC A con- 
sists of finding the proper basis set {eAd)} that mini- 
mizes the remaining squared error ej ■ e, for a given 
dimension d. These optimum basis vectors {ej(d)} are 
called the Principal Components (PC's), and form a set 
of orthogonal vectors ordered by the amount of variation 
of the original sample they can explain. More intuitively, 
they are a new set of axes in the 114 dimensional space, 
ordered according to the fraction of the data lying along 
that axis. As an extreme example, if the energies of all 
55 alloys were proportional each other, then all the al- 
loy vectors would lie along a single line, and the first 
PC would be a subspace that encompassed all the data 
(d=l). 
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FIG. 1. The RMS error as a function of the number of 
principal components. The solid lines show results for the 
libraries containing 35, 45, and 55 alloys. The dashed line 
shows results for the 55-alloy library where the energies for 
each alloy have been randomly permuted. 

A Principal Components Analysis of our ab-initio data 
set (Figure 1) shows that significant dimension reduction 
is possible in the space of structural energies. The solid 
curve, labelled "55" , in Figure 1 shows the remaining un- 
explained Root Mean Square (RMS) error (average error 
in the 114 structural energies of the 55 alloys), as function 
of the number of principal components d. All quantities 
are given as energy per atom. The number of relevant 
dimensions depends on the error one can tolerate. For 
example, to describe the energies with a 50 meV RMS 
error, only 9 dimensions are required, much less than the 



original 114. The implication is that it is possible to per- 
form far fewer than 114 calculations to parameterize the 
9 dimensional subspace, and then derive the other 105 
energies through linear relationships given by the PC A. 

Dimension reduction holds only because the energy dif- 
ferences of structures are strongly correlated between al- 
loys with different chemistry. In fact, if we perform a 
PCA analysis in which the structural energies for each 
alloy are randomly permuted, and hence destroy their 
relations, there is little opportunity for dimension reduc- 
tion, as the dashed curve in Figure 1 shows. Given an 
acceptable accuracy, dimension reduction does not de- 
pend on the dimension of the library, once the library 
is bigger than a certain size. Figure 1 shows the PCA 
analysis for 35, 45, and 55 alloys. For subspaces defined 
by up to « 20 PC's (27 meV RMS accuracy) the vari- 
ance is essentially independent of the number of alloys, 
indicating that the dimension reduction we obtain can be 
expected to apply to new alloy systems. 

These correlations are further confirmation that the 
success of heuristic methods is not accidental, and that 
with relatively few parameters it can be possible to pre- 
dict the structure of a binary alloy. In fact, these correla- 
tions can be used to develop an ab-initio- data-mining al- 
gorithm that rapidly searches through the available space 
of possible structures. 

Given a library of N a alloys, N s structures, and a new 
alloy where the first n energies have been calculated, we 
predict the energy for structure i > n of the new alloy 
as follows. Define X as the (n,N a ) matrix of energies 
for structures {l...n} in the library. Define y as the 
n— component vector of known energies for the new al- 
loy and X' as the N a component vector of energies for 
structure i for all alloys in the library. The scalar y' rep- 
resents the unknown energy of structure i for the new 
alloy. We regress y on X using the Partial Least Squares 
method [15,16] implemented with the SIMPLS algorithm 
[17]. The resulting regression coefficients are used to pre- 
dict y' from X'. This is done for every structure of the 
new alloy for which the energy has not yet been calcu- 
lated. 

The ground states for an alloy are found through an it- 
erative scheme. At each step, the PLS regression is used 
to find the most probable ground state, which is then cal- 
culated with quantum mechanics and added to the data. 
The algorithm is started with only the pure element en- 
ergies for the two elements of the alloy in the bec, fee, 
and hep structures, and then proceeds as follows. 
Step 1 (prediction). The regression algorithm given 
above is used to predict all unknown structural energies 
in the new system. We found that for early iterations 
(< 10) the RMS error can be reduced by preclustering the 
library into ordering and phase-separating systems and 
regressing only within the library subcluster in which the 
system is predicted to fall. Physically, this means that for 
early stage of the iterative procedure, new alloys regress 
better with similar alloys than with the complete library. 
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Step 2 (suggestion). With the available calculated en- 
ergies we determine the ground-state energy versus com- 
position curve (convex- hull) . The convex hull, which is 
the set of tie lines that connects the lowest energy or- 
dered phases, represents the free energy of the alloy at 
OK. Structures with energy above a tie line, are unstable 
with respect to mixtures of the two structures that de- 
fine the vertices of the tie line. Hence, the convex hull 
determines the phase stability of the system at zero tem- 
perature. The structure with predicted energy farthest 
below the convex hull of calculated energies is calculated 
with quantum mechanics and added to the database. If 
no structure breaks the hull, we look for the structure 
predicted to be closest to the hull. For early iterations 
(< 13 in Figure 2), if no such structure can be found 
within 80 meV of the ground state hull, we consider the 
prediction to have failed in this step, and instead add the 
most frequent and not yet calculated ground state struc- 
ture of the database. 

Step 3 (calculation). The candidate suggested struc- 
ture is then calculated with Quantum Mechanics and 
added to the list, and the entire process is iterated (pre- 
diction => suggestion => calculation). With each step, 
more energetic information for the new alloy is incorpo- 
rated and a better prediction of the ground state can be 
expected. 
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FIG. 2. Number of calculations as a function of the per- 
centage of ground states predicted correctly, with the DMQC 
method (solid line) and with random structure selection 
(dashed line). Ninety percent accuracy can be achieved with 
DMQC with 26 calculations, much less than the 98 calcula- 
tions necessary for random structure selection. 

Any structure in the library can be predicted and there 
are no preconceived biases as to the symmetry or un- 
derlying supcrlattice of the structure as is the case for 
methods that work with lattice model approaches. For 
example, in the Ti-Pt alloy, our method correctly finds 
the A15 [18,19] structure to be a ground state for TiaPt 
after only 20 steps in the algorithm, even though this 



structure is not a superstructure of fee (the structure of 
Pt) or hep (the structure of Ti), and is therefore not 
an obvious structure to investigate for this system. To 
study in a more statistically significant way how this it- 
erative scheme converges we tested how well the library 
minus one alloy can perform predictions on the alloy left 
out. A key property is whether the alloy is immiscible 
(no ordered compounds) or has intermediate compounds 
(compound- forming) . Empirical schemes like the one de- 
veloped by Miedema [6] have been particularly success- 
fid in classifying this difference. We find that DMQC 
can predict whether an alloy excluded from the library 
is compound-forming with 95%, 98%, and 100% accu- 
racy using 3, 6, and 13 calculations, respectively. Note 
that here and below, we do not count the initial pure el- 
ement calculations, since these are only performed once 
for each element. For comparison, if one randomly picked 
trial structures from the list of 114 structures, predictions 
with 95%, 98%, and 100% accuracy require 7, 21, and 98 
calculations, respectively. The DMQC method performs 
extremely well, far better than a naive random choice 
of structures, and gives almost perfect prediction with a 
small amount of computation. 

A more stringent evaluation is whether the correct sta- 
ble crystal structures are predicted for the system left 
out. Figure 2 (solid line) shows the number of calcula- 
tions required as a function of the percentage of ground 
states predicted correctly (averaged over all alloys). For 
our purpose, "correct" is what would be obtained from 
the direct quantum mechanical calculations on all 114 
structures. Ninety percent accuracy can be achieved with 
less than 26 calculations for an alloy. To achieve the same 
confidence level with random structure selection (dashed 
line) one needs to calculate almost the complete database 
(98 calculations). 

Even though it is generally believed that the binary al- 
loys are well characterized experimentally, our approach 
can be used to quickly predict previously unknown stable 
structures in some systems. For example, with only 26 
calculations we predict Ag3Cd and Ag2Cd respectively to 
have the DO24 and C37 structure. In addition, we pre- 
dict the previously unidentified structure for CdZr3 to 
be A15 (Cr 3 Si-type). This prediction takes only 21 iter- 
ations and is particularly interesting since A15 does not 
share the hep parent lattice of Cd and Zr. These predic- 
tions were confirmed by calculation of all the structures 
in the library. A more detailed analysis of the predictions 
made from our database in a large number of systems will 
be published elsewhere. 

More structures will need to be added to the library 
to give the method better applicability to many unknown 
systems. It is therefore important to assess how the num- 
ber of required calculations scales with the number of 
structures in the library. This scaling is shown in Figure 
3 for various required confidence levels. As the library 
grows, more calculations are needed to select between 
the increasing number of possibilities. Fortunately, the 
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number of calculations increases less than linearly with 
the number of structures in the database, demonstrating 
that efficiency increases as the library grows. 

Our current DMQC approach has the limitation that 
structure types must already be in the database to be 
predicted. However a concerted effort to develop a pub- 
lic database, analogous to those used in biology, may 
make this limitation less important. Our work has also 
focussed on a simple test library of binary alloys. The 
real payoff will come with the inclusion of multicompo- 
nent systems, where fewer than 10% of all intermetal- 
lic systems have been characterized [7,20]. A library of 
ternary structures can be integrated with the binary li- 
braries and extensions of the formalism are not required, 
besides adding an extra composition variable. Although 
the data mining methods discussed here are centered 
around dimension reduction and linear correlation ap- 
proaches, including nonlinear methods (e.g., neural nets, 
clustering, learning machines, etc.) will certainly be more 
effective in extracting information from the library. For 
example, problems associated with using a single set of 
regression coefficients for the whole heterogeneous data 
set can be avoided by preclustering the data and using 
linear regression only within each cluster. 

40 1 1 1 1 1 1 




5 ■ 



40 60 80 100 120 

# structures in historical data 

FIG. 3. Fhe average number of calculations needed to ob- 
tain a given accuracy of predicted crystal structures, as a func- 
tion of the number of structures in the library. Results are 
given for 80%, 85%, 90%, and 95% accuracies. Fhe number 
of calculations increases less than linearly with the number 
of structures in the database, demonstrating that efficiency 
increases as the library grows. 

In summary, by data mining quantum mechanical cal- 
culations (DMQC) we have established that there exist 
significant correlations among ab-initio energies of differ- 
ent structures in different materials. The correlations we 
found can be seen as a formal extension of the heuristic 
structure-properties selection rules that have been estab- 
lished in the past on the basis of large amounts of ex- 
perimental structure information [20-22] . Our approach 



differs from the previous classifications in that we corre- 
late on calculated information (structural energies in our 
particular example), and hence our description can be 
used when there is limited experimental data, and can 
be extended to arbitrary accuracy. 

The data-mined correlations form the basis for an ef- 
ficient algorithm for structure prediction which has all 
the capacities of ab-initio energy methods, but extracts 
information from previous calculations on other systems 
in order to efficiently propose candidate structures. Be- 
cause structures are not found through optimization of 
some physical variable space (e.g. atomic coordinates), 
it has none of the problems with time-scale and equili- 
bration common to other approaches. We believe that 
the integration of data mining techniques with ab-initio 
methods is a promising development towards the practi- 
cal prediction of crystal structure. 

The research was supported by the Department of En- 
ergy, Office of Basic Energy Science under Contract No. 
DE-FG02-96ER45571. 
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